TD(λ) Networks: Temporal-Difference Networks with Eligibility Traces
نویسندگان
چکیده
Temporal-difference (TD) networks have been introduced as a formalism for expressing and learning grounded world knowledge in a predictive form (Sutton & Tanner, 2005). Like conventional TD(0) methods, the learning algorithm for TD networks uses 1-step backups to train prediction units about future events. In conventional TD learning, the TD(λ) algorithm is often used to do more general multi-step backups of future predictions. In our work, we introduce a generalization of the 1-step TD network specification that is based on the TD(λ) learning algorithm, creating TD(λ) networks. We present experimental results that show TD(λ) networks can learn solutions in more complex environments than TD networks. We also show that in problems that can be solved by TD networks, TD(λ) networks generally learn solutions much faster than their 1-step counterparts. Finally, we present an analysis of our algorithm that shows that the computational cost of TD(λ) networks is only slightly more than that of TD networks. Temporal-difference (TD) networks are a formalism for expressing and learning grounded knowledge about dynamical systems (Sutton & Tanner, 2005). TD networks are one approach that can be used to learn a predictive representation of state (Littman et al., 2002; Jaeger, 1998; Rivest & Schapire, 1990). Predictive representations are one of several approaches to learning generative models of dynamical systems that can be used for planning and/or reinforcement learning. Predictive representations can be learned from data (Wolfe et al., 2005), and have been shown to have richer representational power than competing approaches like POMDPs and n-order Markov models (Singh et al., Appearing in Proceedings of the 22 International Conference on Machine Learning, Bonn, Germany, 2005. Copyright 2005 by the author(s)/owner(s). 2004). Predictive representations represent the state of a dynamical system as a vector of predictions about future action– observation sequences. The hypothesis that important knowledge about the world can be represented strictly in terms of predictions of relationships between observable quantities is a novel idea that distinguishes predictive representations from other approaches. In predictive state representations (PSRs) introduced by Littman et al. (2002), each prediction is an estimate of the probability of some sequence of observations given a sequence of actions. TD networks generalize PSRs; in TD networks each prediction is an estimate of the probability or expected value of some function of future predictions and observations. The predictions are thought of as “answers” to a set of “questions” represented within the TD network. The essential idea of TD learning can be described as learning a guess from a guess. Before TD networks, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. TD networks generalize TD methods by allowing the second guess to be different from the first. The current TD network learning algorithm uses 1-step backups; the target for a prediction comes from the subsequent time step. In conventional TD learning, the TD(λ) algorithm is often used to do more general, nstep backups. Rather than a single future prediction, n-step backups use a weighted average of future predictions as a target for learning (Sutton, 1988). TD(λ) is really a spectrum of algorithms, controlled by the continuous valued parameter λ ∈ [0, 1]. TD(λ=0) uses the earliest possible prediction as the target for learning while TD(λ=1) uses the latest possible prediction as the target. For other values of λ, the targets for learning are distributed among all of the predictions along the way. A worthwhile contribution to the work on predictive representations would be a least-squares algorithm for learning TD networks similar to LSTD (Boyan, 2002). We feel that TD networks are still not well understood and that TD(λ) Networks : Temporal-Difference Networks with Eligibility Traces the TD(λ) algorithm still remains of widespread interest, leaving LSTD networks an interesting possibility for future work. In this work, we present an extension to the TD network learning algorithm that uses n-step backups of future predictions and observations. We call our approach TD(λ) networks. In this work we show that TD(λ) networks generalize the previous approach; that the TD network specification is identical to a TD(λ=0) network. From this point forward, we refer to both 1-step and n-step TD networks simply as TD networks. When the distinction is important, we will refer to the previous specification as TD(0) networks and our new specification as TD(λ) networks. In Section 1 we review the TD(0) networks specification. We present and discuss the new learning algorithm in Sections 2 and 3 respectively. In Section 4 we compare the performance of TD(λ) networks for various values of λ, including TD(λ=0) networks. In Section 5 we have included a detailed cost analysis of our implementation the algorithm. Finally, we conclude and discuss our results in Section 6.
منابع مشابه
Experimental analysis of eligibility traces strategies in temporal difference learning
Temporal difference (TD) learning is a model-free reinforcement learning technique, which adopts an infinite horizon discount model and uses an incremental learning technique for dynamic programming. The state value function is updated in terms of sample episodes. Utilising eligibility traces is a key mechanism in enhancing the rate of convergence. TD(λ) represents the use of eligibility traces...
متن کاملDouble Q($\sigma$) and Q($\sigma, \lambda$): Unifying Reinforcement Learning Control Algorithms
Temporal-difference (TD) learning is an important field in reinforcement learning. Sarsa and Q-Learning are among the most used TD algorithms. The Q(σ) algorithm (Sutton and Barto (2017)) unifies both. This paper extends the Q(σ) algorithm to an online multi-step algorithm Q(σ, λ) using eligibility traces and introduces Double Q(σ) as the extension of Q(σ) to double learning. Experiments sugges...
متن کاملKernel Least-Squares Temporal Difference Learning
Kernel methods have attracted many research interests recently since by utilizing Mercer kernels, non-linear and non-parametric versions of conventional supervised or unsupervised learning algorithms can be implemented and usually better generalization abilities can be obtained. However, kernel methods in reinforcement learning have not been popularly studied in the literature. In this paper, w...
متن کاملAn Introduction to Temporal Difference Learning
Temporal Difference learning is one of the most used approaches for policy evaluation. It is a central part of solving reinforcement learning tasks. For deriving optimal control, policies have to be evaluated. This task requires value function approximation. At this point TD methods find application. The use of eligibility traces for backpropagation of updates as well as the bootstrapping of th...
متن کاملTemporal Abstraction in TD Networks
Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment (Sutton & Tanner, 2005). These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of a...
متن کامل